[codex] Fix Mamba conv params under fine-grained FSDP gather by ilml · Pull Request #4467 · NVIDIA/Megatron-LM

ilml · 2026-04-24T22:26:49Z

Summary

Fix fine-grained Megatron-FSDP parameter gathering for Mamba's fused conv path.

This branch is now stacked with the MambaLayer FSDP support from #4329 first, then the direct conv-param gather fix. #4329 makes MambaLayer an FSDP unit and fixes TP annotation for SSM parameters; this PR handles the remaining case where MambaMixer.forward() reads a child module's parameters without invoking that child module's forward hook.

Mamba's memory-efficient fused path reads conv1d.weight and conv1d.bias directly and passes them into mamba_split_conv1d_scan_combined, instead of calling Conv1d.forward(). With fine-grained Megatron-FSDP gather enabled, that means the child conv1d module's pre-forward gather hook never runs. After the first forward releases parameter storage, the second forward can pass a null-base sharded view into causal-conv, producing an illegal memory access.

This PR adds an opt-in _fsdp_extra_forward_param_modules hook for modules that directly read child-module params, and uses it from MambaMixer for self.conv1d. It also makes Mamba context-parallel parameter access resolve through the live mixer object so FSDP-updated parameters are not bypassed by stale cached references.

Validation

python3 -m py_compile megatron/core/distributed/fsdp/mcore_fsdp_adapter.py megatron/core/distributed/fsdp/src/megatron_fsdp/megatron_fsdp.py megatron/core/distributed/torch_fully_sharded_data_parallel.py megatron/core/ssm/mamba_context_parallel.py megatron/core/ssm/mamba_mixer.py tests/unit_tests/distributed/megatron_fsdp/test_mcore_tensor_parallelism_detect.py
python3 -m pytest tests/unit_tests/distributed/megatron_fsdp/test_mcore_tensor_parallelism_detect.py -q could not run on the login node because pytest is not installed there.
Before stacking Add MambaLayer FSDP support and fix TP annotation for SSM parameters #4329, a clean interactive run of /home/tolong/work/dsv3/interactive_nt.sh completed through iteration 50 and saved checkpoint with no CUDA error, illegal memory, or FAILED in the log:
/lustre/fsw/coreai_dlalgo_llm/tolong/results/nemo_megatron/megatron/nemotron6/hybrid/debug/interactive_nt_debug/logs/interactive_nt_full_20260424_150942.log
Earlier two-iteration blocking-CUDA smoke test completed and saved checkpoint at iteration 2 after reproducing the pre-fix crash at that boundary.

copy-pr-bot · 2026-04-24T22:26:53Z

This pull request requires additional validation before any workflows can run on NVIDIA's runners.

Pull request vetters can view their responsibilities here.

Contributors can view more details about this message here.

cspades · 2026-04-24T23:07:42Z

CC'd from DM's

Hmm, so the MambaMixer has a Conv1D submodule but the Conv1D submodule's weights are used in Autograd functions during the MambaMixer.forward() pass?

I assume you are using fine_grained_all_gather since the Conv1D is actually a sub-module of MambaMixer so only in the case where the Conv1D weights would not be AG'd is if we are using fine-grained module AG hooks, in which case we never trigger the Conv1D hook.
There is also a post-forward / post-backward hook that calls this function: release_module_parameters. It needs to be called on the modules in extra_forward_param_modules so we can re-shard them. Try asking Codex what is the cleanest, most generalizable way to have the hooks support re-sharding these untouched modules as well.
Are you setting fsdp_unit_modules=[TransformerLayer, MambaLayer]? Either you hit the storage error on Step 1 (the very first AG) but if Mamba is not a FSDP unit, it will never be re-sharded, skipping the post-forward / post-backward hook.
- @Phlip79 has a PR and is testing this as well: Add MambaLayer FSDP support and fix TP annotation for SSM parameters #4329

MambaLayer (GraphableMegatronModule) was not recognized as an FSDP sharding unit, causing its parameters to remain in the root group and defeating ZeRO-3 param sharding for Mamba and hybrid models. Additionally, MambaMixer sets tensor_model_parallel and partition_dim directly on parameters (conv1d, A_log, dt_bias, D, norm.weight) rather than on the owning module. The TP annotation logic only checked module-level attributes, so these parameters were either unclassified or misclassified by the norm-name fallback (e.g. ExtendedRMSNorm treated as replicated when actually TP-sharded). Changes: - Register MambaLayer in default fsdp_unit_modules (mcore_fsdp_adapter) and sub_modules_to_wrap (torch_fully_sharded_data_parallel) - Add param-level TP attribute fallback in _detect_parallelism_type, placed before the norm-name fallback so TP-sharded norm weights are correctly classified - Pass param through from _annotate_tensor_parallelism - Add tests for param-level TP detection, norm override, and a MambaMixer-like end-to-end annotation test Co-Authored-By: Claude Opus 4.6 (1M context) <[email protected]>

Mamba's fused path reads conv1d weights directly instead of calling Conv1d.forward(), so fine-grained Megatron-FSDP never gathered those child parameters before the second forward. Register the conv module as an extra forward-gather source and resolve context-parallel Mamba params from the live mixer object.

cspades · 2026-04-24T23:23:52Z

Also, if we do implement this, I think Torch FSDP2 has something similar for "hitch-hiking" parameters into the AG: Tensor.fsdp_pre_all_gather and Tensor.fsdp_post_all_gather (Example in TE: https://github.com/NVIDIA/TransformerEngine/blob/main/transformer_engine/pytorch/tensor/mxfp8_tensor.py#L613)

That's per-parameter / per-Tensor though, so we're basically doing the same thing here with modules, I suppose.

cspades

Looks pretty good, per-module parameter hitch-hiking seems reasonable for me since our FSDP units are based on modules as well.

@shjwudp @Autumn1998 @wujingyue in the re-write we should expose this as an API, but this PR defines the feature, is this general enough?

cspades · 2026-04-27T23:54:37Z

-        return self._slice_conv_param(self.conv1d_cp1.weight)
+        conv1d = self._mixer.conv1d if self._mixer is not None else self.conv1d_cp1
+        return self._slice_conv_param(conv1d.weight)


Have to ask, where/when do we have "stale" references such that we can no longer directly retrieve the weight?

wujingyue

Thanks for the fix!

IIUC,

Megatron-LM/megatron/core/ssm/mamba_context_parallel.py

Line 214 in 42e396e

return F.conv1d(

is where def conv1d calls the F.conv1d instead of a submodule.

Instead, could MambaContextParallel's constructor create the submodule from conv1d_cp1? For your reference, https://gitlab-master.nvidia.com/clara-discovery/boltz/-/blob/dev/src/boltz/distributed/model/layers/triangular_attention.py#L1490 is an internal example that adopts this practice for context parallelism.

cc @cspades

cspades · 2026-04-29T04:14:52Z

+            extra_forward_param_modules = getattr(module, "_fsdp_extra_forward_param_modules", ())
+            if isinstance(extra_forward_param_modules, nn.Module):
+                extra_forward_param_modules = (extra_forward_param_modules,)
+            if extra_forward_param_modules:
+                seen_param_ids = {id(param) for param in param_list}
+                for extra_module in extra_forward_param_modules:
+                    for extra_param in extra_module.parameters():
+                        if id(extra_param) not in seen_param_ids:
+                            param_list.append(extra_param)
+                            seen_param_ids.add(id(extra_param))
+


There is also a post-forward / post-backward hook that calls this function: release_module_parameters. It needs to be called on the modules in extra_forward_param_modules so we can re-shard them.

Early note, what about the pre-backward param unshard?

We need to ensure that any rogue weights are re-sharded, and un-sharded during the backward pass. Did you check that the Conv-1D weights are re-sharded?

cspades

Still WIP, will approve when all features are implemented!

cspades · 2026-04-30T18:30:37Z

Need to make sure the module re-shard is working, currently doesn't look like the case.

wujingyue

Request for clarification: did you run into this issue with enable_fine_grained_param_gather_hook? When MFSDP is applied to MambaLayer, it ought to find all parameters under that layer including the ones in conv1d. It wasn't clear to me how MFSDP missed the parameter in the first place. cc @ilml

cspades · 2026-04-30T21:07:13Z

Request for clarification: did you run into this issue with enable_fine_grained_param_gather_hook? When MFSDP is applied to MambaLayer, it ought to find all parameters under that layer including the ones in conv1d. It wasn't clear to me how MFSDP missed the parameter in the first place. cc @ilml

                enable_fine_grained_param_gather_hook=(
                    config.fp8_recipe == "mxfp8" and ddp_config.fp8_param_gather
                ),

It's automatically turned on for MXFP8. (I still forget exactly why.)

ilml · 2026-04-30T21:21:03Z

Request for clarification: did you run into this issue with enable_fine_grained_param_gather_hook? When MFSDP is applied to MambaLayer, it ought to find all parameters under that layer including the ones in conv1d. It wasn't clear to me how MFSDP missed the parameter in the first place. cc @ilml

yeah this is a very subtle bug:
but basically, its all because we are using third party kernel for mamba, and this kernel is NOT inheriting from nn.Module, and it exacts raw weights to passing the kernel for efficiency. and this escape the mfsdp sharding harness. which eventually cause the memory access bug.

wujingyue · 2026-04-30T23:31:47Z

Similar to what I said earlier, I think the right fix is to make self.cp an nn.Module that co-owns the parameters that it needs.

The reasons are:

This is a good practice to follow in general--each parameter is owned by the Module that actually uses it.
This avoids the _fsdp_extra_forward_param_modules hack completely. Model writers would have to know MFSDP implementation really well to know whether/where to add _fsdp_extra_forward_param_modules. It's also extra feature/code to maintain in MFSDP.
This makes the model compatible with the future per-module MFSDP API. We expect fully_shard to be called on each nn.Module that forms an FSDP unit and to insert hooks to that nn.Module. With this PR as is, I couldn't find any nn.Module that we can call fully_shard on. self.conv1 would be the wrong module because it's not even executed when CP is on. The MambaMixer would be the wrong module because that would put all parameters under into one ParameterGroup.

Wdyt?

Phlip79 and others added 2 commits April 24, 2026 16:20

ilml force-pushed the codex/fix-mamba-fsdp-direct-conv-gather branch from 5590832 to 19b794b Compare April 24, 2026 23:21

ilml marked this pull request as ready for review April 27, 2026 23:21

ilml requested review from a team as code owners April 27, 2026 23:21

svcnvidia-nemo-ci requested a review from a team April 27, 2026 23:21

Phlip79 mentioned this pull request Apr 27, 2026

Add MambaLayer FSDP support and fix TP annotation for SSM parameters #4329

Closed

ilml requested review from cspades and shjwudp April 27, 2026 23:23

svcnvidia-nemo-ci added the complexity: low label Apr 27, 2026

Merge branch 'main' into codex/fix-mamba-fsdp-direct-conv-gather

a7c3922

cspades reviewed Apr 28, 2026

View reviewed changes

wujingyue reviewed Apr 28, 2026

View reviewed changes

cspades reviewed Apr 29, 2026

View reviewed changes

cspades requested changes Apr 29, 2026

View reviewed changes

wujingyue reviewed Apr 30, 2026

View reviewed changes

Conversation

ilml commented Apr 24, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Validation

Uh oh!

copy-pr-bot Bot commented Apr 24, 2026

Uh oh!

cspades commented Apr 24, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

cspades commented Apr 24, 2026

Uh oh!

cspades left a comment

Choose a reason for hiding this comment

Uh oh!

cspades Apr 27, 2026

Choose a reason for hiding this comment

Uh oh!

wujingyue left a comment

Choose a reason for hiding this comment

Uh oh!

cspades Apr 29, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

cspades left a comment

Choose a reason for hiding this comment

Uh oh!

cspades commented Apr 30, 2026

Uh oh!

wujingyue left a comment

Choose a reason for hiding this comment

Uh oh!

cspades commented Apr 30, 2026

Uh oh!

ilml commented Apr 30, 2026

Uh oh!

wujingyue commented Apr 30, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

ilml commented Apr 24, 2026 •

edited

Loading

cspades commented Apr 24, 2026 •

edited

Loading

cspades Apr 29, 2026 •

edited

Loading

wujingyue commented Apr 30, 2026 •

edited

Loading